# from model project.
global_theme <- theme_few() + # Theme based on S. Few's "Practical Rules for Using Color in Charts"
theme(plot.title = element_text(color = "darkred")) +
theme(strip.text.x = element_text(size = 14, colour = "#202020")) +
theme(plot.margin=margin(10,30,10,30))
This analysis will explore the Wine Quality Dataset.
The winequality.names text file shows information about the data. This is summarised below.
We aim to discover relationships in the data and form models to get a better understanding of the variables of interest. Specifically, we will use classification and clustering methods to develop insights.
This data is publicy available from https://www.kaggle.com/aleixdorca/wine-quality?select=winequality.csv
wine <- read_csv("./winequality.csv")
Let’s verify the dimensions and view the types.
## tibble [6,497 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ fixed acidity : num [1:6497] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile acidity : num [1:6497] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric acid : num [1:6497] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual sugar : num [1:6497] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num [1:6497] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free sulfur dioxide : num [1:6497] 11 25 15 17 11 13 15 15 9 17 ...
## $ total sulfur dioxide: num [1:6497] 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num [1:6497] 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num [1:6497] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num [1:6497] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num [1:6497] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : num [1:6497] 5 5 5 6 5 5 5 7 7 5 ...
## $ good : num [1:6497] 0 0 0 0 0 0 0 1 1 0 ...
## $ color : chr [1:6497] "red" "red" "red" "red" ...
## - attr(*, "spec")=
## .. cols(
## .. `fixed acidity` = col_double(),
## .. `volatile acidity` = col_double(),
## .. `citric acid` = col_double(),
## .. `residual sugar` = col_double(),
## .. chlorides = col_double(),
## .. `free sulfur dioxide` = col_double(),
## .. `total sulfur dioxide` = col_double(),
## .. density = col_double(),
## .. pH = col_double(),
## .. sulphates = col_double(),
## .. alcohol = col_double(),
## .. quality = col_double(),
## .. good = col_double(),
## .. color = col_character()
## .. )
Let’s confirm that there are no NAs.
apply(wine, 2, function(x) any(is.na(x)))
## fixed acidity volatile acidity citric acid
## FALSE FALSE FALSE
## residual sugar chlorides free sulfur dioxide
## FALSE FALSE FALSE
## total sulfur dioxide density pH
## FALSE FALSE FALSE
## sulphates alcohol quality
## FALSE FALSE FALSE
## good color
## FALSE FALSE
We should coerce color to a factor. We should also discretise quality as they are categorical labels ranked from \(0\) to \(10\).
wine$color <- factor(wine$color) # factor color
wine$quality.factor <- factor(wine$quality) # factor quality
str(wine)
## tibble [6,497 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ fixed acidity : num [1:6497] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile acidity : num [1:6497] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric acid : num [1:6497] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual sugar : num [1:6497] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num [1:6497] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free sulfur dioxide : num [1:6497] 11 25 15 17 11 13 15 15 9 17 ...
## $ total sulfur dioxide: num [1:6497] 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num [1:6497] 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num [1:6497] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num [1:6497] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num [1:6497] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : num [1:6497] 5 5 5 6 5 5 5 7 7 5 ...
## $ good : num [1:6497] 0 0 0 0 0 0 0 1 1 0 ...
## $ color : Factor w/ 2 levels "red","white": 1 1 1 1 1 1 1 1 1 1 ...
## $ quality.factor : Factor w/ 7 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## - attr(*, "spec")=
## .. cols(
## .. `fixed acidity` = col_double(),
## .. `volatile acidity` = col_double(),
## .. `citric acid` = col_double(),
## .. `residual sugar` = col_double(),
## .. chlorides = col_double(),
## .. `free sulfur dioxide` = col_double(),
## .. `total sulfur dioxide` = col_double(),
## .. density = col_double(),
## .. pH = col_double(),
## .. sulphates = col_double(),
## .. alcohol = col_double(),
## .. quality = col_double(),
## .. good = col_double(),
## .. color = col_character()
## .. )
Let us rename features for better interpretability.
wine <-
wine %>% rename(
fixed.acidity = `fixed acidity`,
volatile.acidity = `volatile acidity`,
citric.acid = `citric acid`,
residual.sugar = `residual sugar`,
free.sulfur.dioxide = `free sulfur dioxide`,
total.sulfur.dioxide = `total sulfur dioxide`
)
colnames(wine)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [13] "good" "color" "quality.factor"
We should explore the distribution of wine qualities.
##
## 3 4 5 6 7 8 9
## 30 216 2138 2836 1079 193 5
wine %>%
ggplot(aes(quality, fill=color, color=c("red", "white"))) +
geom_histogram(binwidth = .5, col="black") +
facet_grid(color ~ .) +
labs(title="Wine Quality Histogram", subtitle="Wine Quality for Red and White Wine") +
global_theme
The above confirms that both red and white wine have a very similar quality distribution. Let us consider the quality distribution collectively.
wine %>%
ggplot( aes(x=quality) ) +
geom_histogram() +
labs(title="Wine Quality Histogram", subtitle="Wine Quality for ALL Wine") +
global_theme
It is evident that the distribution of quality is not uniform which could be problematic for classification tasks. The data seems to follow a normal distribution. Let’s take a closer look.
wine %>%
ggplot( aes(x=quality) ) +
geom_histogram( aes(y=..count../sum(..count..)) ) +
ylab("proportion") +
stat_function(fun = dnorm, n = 101, args = list(mean = mean(wine$quality), sd = sd(wine$quality)), size=1.0, color="#ec6d5f") +
labs(title="Wine Quality Histogram", subtitle="Normal Distribution Fit") +
global_theme
The models we train will have much more data for mid level qualities than low and high quality wines which could potentially lead to imbalanced predictions. Let’s get more quantitative.
# GET CONFIDENCE INTERVAL OF NORMAL DISTRIBUTION
confidence_interval <- function(mu, sigma, z) {
lower <- mu - z * sigma
upper <- mu + z * sigma
return( c(lower, upper) )
}
mu <- mean(wine$quality)
sigma <- sd(wine$quality)
z <- 1.645 # 90% confidence interval
confidence_interval(mu, sigma, z)
## [1] 4.381873 7.254883
Thus, around \(90\%\) of our data has a quality of either \(5\), \(6\), or \(7\). This means that \(10\%\) of data has a quality of \(3\), \(4\), \(8\), or \(9\).
We can conclude that the testing and validation sets must be stratified samples to be representative of the original imbalanced data set for classifying quality. We can also use Synthetic Minority Oversampling Technique (SMOTE) on the training set to handle the imbalanced data problem. In short, we add synthetic instances to the minority classes to create a more uniform distribution. These synthetic instances are intelligently chosen by drawing a line between the examples in the feature space and creating a new sample at a point along that line.
https://arxiv.org/pdf/1106.1813.pdf (SMOTE paper, 2002)
We will have to address these points when preparing the data for classification models of quality.
Let us firstly identify any correlations between feature variables. We will compute the Pearson’s coefficient.
num.vars <- setdiff( names(wine), c("color", "quality.factor") ) # correlation matrix only accepts class num
corr <- round( cor( wine[ , num.vars ], method="pearson" ), 2 ) # Lets find correlation using cor()
corr
## fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity 1.00 0.22 0.32 -0.11
## volatile.acidity 0.22 1.00 -0.38 -0.20
## citric.acid 0.32 -0.38 1.00 0.14
## residual.sugar -0.11 -0.20 0.14 1.00
## chlorides 0.30 0.38 0.04 -0.13
## free.sulfur.dioxide -0.28 -0.35 0.13 0.40
## total.sulfur.dioxide -0.33 -0.41 0.20 0.50
## density 0.46 0.27 0.10 0.55
## pH -0.25 0.26 -0.33 -0.27
## sulphates 0.30 0.23 0.06 -0.19
## alcohol -0.10 -0.04 -0.01 -0.36
## quality -0.08 -0.27 0.09 -0.04
## good -0.05 -0.15 0.05 -0.06
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## fixed.acidity 0.30 -0.28 -0.33 0.46
## volatile.acidity 0.38 -0.35 -0.41 0.27
## citric.acid 0.04 0.13 0.20 0.10
## residual.sugar -0.13 0.40 0.50 0.55
## chlorides 1.00 -0.20 -0.28 0.36
## free.sulfur.dioxide -0.20 1.00 0.72 0.03
## total.sulfur.dioxide -0.28 0.72 1.00 0.03
## density 0.36 0.03 0.03 1.00
## pH 0.04 -0.15 -0.24 0.01
## sulphates 0.40 -0.19 -0.28 0.26
## alcohol -0.26 -0.18 -0.27 -0.69
## quality -0.20 0.06 -0.04 -0.31
## good -0.16 0.01 -0.05 -0.28
## pH sulphates alcohol quality good
## fixed.acidity -0.25 0.30 -0.10 -0.08 -0.05
## volatile.acidity 0.26 0.23 -0.04 -0.27 -0.15
## citric.acid -0.33 0.06 -0.01 0.09 0.05
## residual.sugar -0.27 -0.19 -0.36 -0.04 -0.06
## chlorides 0.04 0.40 -0.26 -0.20 -0.16
## free.sulfur.dioxide -0.15 -0.19 -0.18 0.06 0.01
## total.sulfur.dioxide -0.24 -0.28 -0.27 -0.04 -0.05
## density 0.01 0.26 -0.69 -0.31 -0.28
## pH 1.00 0.19 0.12 0.02 0.03
## sulphates 0.19 1.00 0.00 0.04 0.03
## alcohol 0.12 0.00 1.00 0.44 0.39
## quality 0.02 0.04 0.44 1.00 0.76
## good 0.03 0.03 0.39 0.76 1.00
It would be more beneficial if we visualise the results.
# Let us plot a Correlogram using above returned results.
corr %>%
ggcorrplot(
hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("#ec6d5f", "white", "#27c9b8"),
title="Correlogram of Wine Data",
ggtheme=global_theme
)
A summary of the findings is shown below:
Let us look closely into the highly correlated attributes as below. We will also separate by color to start identifying trends between red and white wine.
plot1 <- wine %>%
ggplot(aes(x=quality.factor, y=alcohol) ) +
geom_boxplot( aes(fill=color) ) +
labs(title="Alcohol vs Quality") +
xlab("quality") +
global_theme
plot2 <- wine %>%
ggplot(aes(x=quality.factor, y=density) ) +
geom_boxplot( aes(fill=color) ) +
labs(title="Density vs Quality") +
xlab("quality") +
global_theme
plot3 <- wine %>%
ggplot(aes(x=quality.factor, y=chlorides) ) +
geom_boxplot( aes(fill=color) ) +
labs(title="Chlorides vs Quality") +
xlab("quality") +
global_theme
plot4 <- wine %>%
ggplot(aes(x=quality.factor, y=volatile.acidity) ) +
geom_boxplot( aes(fill=color) ) +
labs(title="Volatile Acidity vs Quality") +
xlab("quality") +
global_theme
grid.arrange(plot1, plot2, plot3, plot4, ncol=2)
A summary of the findings is shown below:
Overall Insights
A classification task for wine quality based on wine chemistry seems difficult as the distributions above seem to have very slight variations. This becomes even more difficult when we have an imbalanced data set. We should keep in mind that a model off by \(\pm 1\) quality rating may not be too bad. We may care about wine being high vs medium quality more than \(6\) vs \(7\) quality.
We should come back to these points when creating our models.
Comparing Red and White Wine
Firstly, we can see that there are no quality \(9\) wines that are red. In fact, there are only \(5\) instances that are this quality. We should remove these instances as no classifier can explain a very high quality wine from such a small number of observations, even with any SMOTE techniques.
df <- wine # df will hold the untouched data set if needed
wine <- wine[wine$quality != 9,] # filter out the 5 quality 9 obs.
The distributions for density, chlorides and volatile acidity seem very separated between red and white wines. A classification task for determining wine color based on its chemistry does not seem that complex as the data seems linearly separable for many feature variables. Let’s take a closer look.
dense1 <- wine %>%
ggplot( aes(x=alcohol) ) +
geom_density( aes(fill=color), alpha=0.5 ) +
ylab("") +
labs(title="Alcohol") +
global_theme
dense2 <- wine %>%
ggplot( aes(x=density) ) +
geom_density( aes(fill=color), alpha=0.5 ) +
ylab("") +
labs(title="Wine Density") +
global_theme
dense3 <- wine %>%
ggplot( aes(x=chlorides) ) +
geom_density( aes(fill=color), alpha=0.5 ) +
ylab("") +
labs(title="Chlorides") +
global_theme
dense4 <- wine %>%
ggplot( aes(x=volatile.acidity) ) +
geom_density( aes(fill=color), alpha=0.5 ) +
ylab("") +
labs(title="Volatile Acidity") +
global_theme
grid.arrange(dense1, dense2, dense3, dense4, ncol=2)
The above confirms good separation between density, chlorides and volatile acidity. Alcohol content does not divide wine color well as expected from the box plots.
The correlation matrix also shows that total sulfur dioxide is highly positively correlated with color. Let us plot total sulfur dioxide and chlorides to view any trends.
wine %>%
ggplot(aes(x=chlorides, y=total.sulfur.dioxide, color=color)) +
geom_point(size=1, alpha=0.5) +
labs(title="Total Sulfur Dioxide vs Chlorides") +
global_theme
It is evident that wine color is linearly separable between these two variables. A simple multi-variable linear model seems like it would perform very well.
That is enough exploring! Let us prepare our data for training.
The first step is to divide our data into a training, calibration and test set for modeling.
# PREPARE DATASET
d <- wine
# CREATE TRAIN & TEST SET
set.seed(4009)
d$rgroup <- runif( nrow(d) )
dTrainAll <- d %>% subset(rgroup <= 0.9) %>% as.data.frame # 90% train
dTest <- d %>% subset(rgroup > 0.9) %>% as.data.frame # 10% test
# ISOLATE FEATURE VARIBLES
outcomes <- c("color")
vars <- setdiff(colnames(d), c(outcomes, "rgroup") )
# SEPARATE CATEGORICAL AND NUMERICAL FEATURE VARIABLES
catVars <- vars[ sapply(dTrainAll[, vars], class) %in% c("factor", "character") ] # should be empty
numericVars <- vars[ sapply(dTrainAll[, vars], class) %in% c("numeric", "integer") ]
# CLEAN MEMORY
rm(list = c("d"))
Let us now make our response variable to be color and set the positive outcome to be red. We shall also create our calibration set to allow cross-validation during our single variable analysis.
# SET RESPONSE AND POSITIVE OUTCOME VARS
outcome <- "color"
pos <- "red"
# CREATE CALIBRATION SET (10% - TRAIN)
useForCal <- rbinom( n=nrow(dTrainAll), size = 1, prob = 0.1 ) > 0
dCal <- subset(dTrainAll, useForCal)
dTrain <- subset(dTrainAll, !useForCal)
Let us find the majority class classifier as our null model in order to identify a good baseline.
col.tab <- prop.table(table(wine$color))
red <- round(col.tab[1], 4)
white <- round(col.tab[2], 4)
ggplot(wine, aes(x = color, fill = color)) +
geom_bar(aes(y = (..count..) / sum(..count..)), width = 0.75, show.legend = FALSE) +
scale_y_continuous(labels = percent_format()) +
annotate("text", x = 'red', y = red + 0.02, label = paste0(red * 100, "%"), size=rel(6.0)) +
annotate("text", x = 'white', y = white + 0.02, label = paste0(white * 100, "%"), size=rel(6.0)) +
labs(title="Wine Color Frequencies") +
ylab("Relative Frequencies") +
global_theme
It is evident that \(75.37\%\) of wine is white in our data set. This concludes that our null model will perform at an accuracy of \(75.37\%\).
The unequal distribution in the color variable can causes the performance of a given classifier to be biased towards majority class (white). Much research has been done to prove that classifiers outperform on balanced data sets over imbalanced ones. Yanminsun, & Wong, Andrew & Kamel, Mohamed S.. (2011). Classification of imbalanced data: a review is one interesting paper.
We will use the SMOTE (Synthetic Minority Oversampling Technique) oversampling method to smartly produce synthetic instances to balance the data set. We will only use SMOTE for the training data as we want the test and calibration sets to reflect real observations when measuring model accuracies.
# SMOTE - Synthetic Minority Oversampling Technique
balancedTrain <- SMOTE(color ~., dTrain, perc.over = 100, k = 5, perc.under = 200)
Let’s verify that the synthetic sampling worked.
col.tab <- prop.table(table(balancedTrain$color))
red <- round(col.tab[1], 4)
white <- round(col.tab[2], 4)
ggplot(balancedTrain, aes(x = color, fill = color)) +
geom_bar(aes(y = (..count..) / sum(..count..)), width = 0.75, show.legend = FALSE) +
scale_y_continuous(labels = percent_format()) +
annotate("text", x = 'red', y = red + 0.02, label = paste0(red * 100, "%"), size=rel(6.0)) +
annotate("text", x = 'white', y = white + 0.02, label = paste0(white * 100, "%"), size=rel(6.0)) +
labs(title="SMOTE Wine Color Frequencies") +
ylab("Relative Frequencies") +
global_theme
Now that we have SMOTTED our training dataset, we can start creating some models.
We should begin with finding the best single-variable model to identify which variables have a significant relationship with the response variable (color). We can then use this information to help build our multi-variable models.
Let’s build two functions as we will need to deal with numerical and categorical variables separately. These functions will help make predictions and store the results in a new column of the dataframe named with the prefix ‘pred’.
# MAKE PREDICTIONS ON CATEGORICAL VARIABLES
mkPredC <- function(outCol, varCol, appCol) {
pPos <- sum(outCol==pos)/length(outCol)
naTab <- table(as.factor(outCol[is.na(varCol)]))
pPosWna <- (naTab/sum(naTab))[pos]
vTab <- table(as.factor(outCol),varCol)
pPosWv <- (vTab[pos,]+1.0e-3*pPos)/(colSums(vTab)+1.0e-3)
pred <- pPosWv[appCol]
pred[is.na(appCol)] <- pPosWna
pred[is.na(pred)] <- pPos
pred
}
# DISCRETISATION - BIN NUMERIC FEATURES INTO CATEGORICAL RANGES AND THEN CALL mkPredC TO MAKE PREDICTIONS
mkPredN <- function(outCol, varCol, appCol) {
# DISCRETISE CRV
cuts <- unique( as.numeric( quantile(varCol, probs=seq(0, 1, 0.1), na.rm=T) ) ) # equal frequency binning
varC <- cut(varCol, cuts)
appC <- cut(appCol, cuts)
# MAKE PREDICTION AS FOR CATEGORICAL
mkPredC(outCol, varC, appC)
}
We will also create an AUC function to calculate the area under the ROC curve to help compare performances.
calcAUC <- function(predcol, outcol) {
perf <- performance(prediction(predcol, outcol == pos), 'auc')
as.numeric(perf@y.values)
}
We now loop through each categorical and numerical variable and apply the appropriate predict function as well as the AUC function. If the AUC is above the specified threshold, we will display the training data AUC and the calibration data AUC to compare. When the training and calibration AUC’s are similar, it indicates that the data is representative of the full data set and is likely to generalise well on the test data. The model will be overfitting when the training AUC is much greater than the calibration AUC. The model is too sensitive to small variations in the data causing it to generalise poorly on new data.
thresh = 0.8
# CATEGORICAL PREDICTIONS
for(var in catVars) {
pred <- paste0("pred", var)
balancedTrain[, pred] <- mkPredC(balancedTrain[, outcome], balancedTrain[, var], balancedTrain[, var])
dCal[, pred] <- mkPredC(balancedTrain[, outcome], balancedTrain[, var], dCal[, var])
dTest[, pred] <- mkPredC(balancedTrain[, outcome], balancedTrain[, var], dTest[, var])
aucTrain <- calcAUC(balancedTrain[, pred], balancedTrain[, outcome])
if(aucTrain >= thresh) {
aucCal <- calcAUC(dCal[, pred], dCal[, outcome])
print(sprintf("%s, trainAUC: %4.3f calibrationAUC: %4.3f", pred, aucTrain, aucCal))
}
}
# NUMERICAL PREDICTIONS
for(var in numericVars) {
pred <- paste0("pred", var)
balancedTrain[, pred] <- mkPredN(balancedTrain[, outcome], balancedTrain[, var], balancedTrain[, var])
dTest[, pred] <- mkPredN(balancedTrain[, outcome], balancedTrain[, var], dTest[, var])
dCal[, pred] <- mkPredN(balancedTrain[, outcome], balancedTrain[, var], dCal[, var])
aucTrain <- calcAUC(balancedTrain[, pred], balancedTrain[, outcome])
if(aucTrain >= thresh) {
aucCal <- calcAUC(dCal[, pred], dCal[, outcome])
print(sprintf("%s, trainAUC: %4.3f calibrationAUC: %4.3f", pred, aucTrain, aucCal))
}
}
## [1] "predvolatile.acidity, trainAUC: 0.910 calibrationAUC: 0.885"
## [1] "predresidual.sugar, trainAUC: 0.886 calibrationAUC: 0.883"
## [1] "predchlorides, trainAUC: 0.961 calibrationAUC: 0.969"
## [1] "predfree.sulfur.dioxide, trainAUC: 0.850 calibrationAUC: 0.845"
## [1] "predtotal.sulfur.dioxide, trainAUC: 0.955 calibrationAUC: 0.962"
## [1] "preddensity, trainAUC: 0.808 calibrationAUC: 0.774"
## [1] "predsulphates, trainAUC: 0.836 calibrationAUC: 0.814"
The above show the feature variables with an AUC above the specified threshold. The similarity in training and calibration AUCs indicate that the training data is representative of the full data set. We should now use cross-validation to confirm that our training data is appropriate. The function below reproduces new training and calibration sets and runs the prediction function. This is repeated for a number of folds (e.g. 100) where the mean and standard deviation of the AUCs is returned.
It is important to check that we did not get a lucky calibration set to verify our data. It is also important to relise that our AUC calculations above use synthetic samples from SMOT where cross-validation does not.
# K-FOLD CROSS VALIDATION
fCross <- function(var) {
useForCalRep <- rbinom(n = nrow(dTrainAll), size = 1, prob = 0.1 ) > 0
predRep <- mkPredN(dTrainAll[!useForCalRep, outcome],
dTrainAll[!useForCalRep, var],
dTrainAll[useForCalRep, var])
calcAUC(predRep, dTrainAll[useForCalRep, outcome])
}
k <- 100
aucs <- replicate(k, fCross("total.sulfur.dioxide"))
print(sprintf("cross-validation mean: %4.3f, cross-validation standard deviation: %4.3f", mean(aucs), sd(aucs)))
## [1] "cross-validation mean: 0.947, cross-validation standard deviation: 0.010"
The total.sulfur.dioxide variable gives a cross-validation mean of \(0.948\) which is very close to \(0.958\) from the AUC calculations above. A small standard deviation verifies that the cross-validation AUC does not vary much. Running this function with the other variables that met our threshold gives similar results which confirms that our training set is ready to be used in further modeling.
We have a very high performing single-variable model. Now it is time to see if we can improve by taking advantage of multiple feature variables to make predictions. One model we could train is a K Nearest Neighbour classifier. KNN is expensive both in time and space. Sometimes we can get similar results with more efficient methods such as logistic regression. Logistic regression seems like a good model as we have a relatively small number of feature variables and categorical variables have a small number of levels.
Before training any models, it may be wise to do some feature selection considering that the winequality.names textfile states that “not all input variables are necessarily relevant”. We will perform some feature selection by considering the loglikelyhoods as shown below.
We will first get the base rate of the NULL model.
# Define a convenience function to compute log likelihood.
logLikelihood <- function(outCol, predCol) {
sum(ifelse(outCol==pos,log(predCol),log(1-predCol)))
}
# Compute the base rate of a NULL model
baseRateCheck <-
logLikelihood(dCal[,outcome], sum(dCal[,outcome]==pos)/length(dCal[,outcome]))
baseRateCheck
## [1] -310.4419
We shall now loop through all the variables and select only the most influential features.
selVars <- c()
minStep <- 5
# CATEGORICAL FEATURE SELECTION
for(v in catVars) {
pi <- paste('pred', v, sep='')
liCheck <- 2*((logLikelihood(dCal[,outcome], dCal[,pi]) - 1 - baseRateCheck))
if(liCheck > minStep) {
print(sprintf("%s, calibrationScore: %g",pi ,liCheck))
selVars <- c(selVars, v)
}
}
# NUMERIC FEATURE SELECTION
for(v in numericVars) {
pi <- paste('pred', v, sep='')
liCheck <- 2*((logLikelihood(dCal[,outcome], dCal[,pi]) - 1 - baseRateCheck))
if(liCheck >= minStep) {
print(sprintf("%s, calibrationScore: %g", pi, liCheck))
selVars <- c(selVars, v)
}
}
## [1] "predfixed.acidity, calibrationScore: 32.0468"
## [1] "predresidual.sugar, calibrationScore: 159.033"
## [1] "predchlorides, calibrationScore: 379.684"
## [1] "predfree.sulfur.dioxide, calibrationScore: 72.9201"
## [1] "predtotal.sulfur.dioxide, calibrationScore: 352.099"
## [1] "preddensity, calibrationScore: 7.81743"
## [1] "predsulphates, calibrationScore: 45.4651"
The important features are:
selVars
## [1] "fixed.acidity" "residual.sugar" "chlorides"
## [4] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [7] "sulphates"
We should also prepare our data to avoid feeding the model duplicate information causing multicollinearity. We manage this by removing variables that are too highly correlated to each other.
By referencing the correlogram in the exploration section:
total.sulfur.dioxide and free.sulfur.dioxideresidual.sugar and densitychlorides and densitysulphates and densityWe will keep the variable total.sulfur.dioxide as well as density and remove the others. Note that we could have kept total.sulfur.dioxide, residual.sugar, chlorides and sulphates, however, we should consider the option that results in the lowest number of dimensions.
drop <- c('fixed.acidity', 'residual.sugar', 'free.sulfur.dioxide', 'chlorides', 'sulphates')
selVars <- selVars[!selVars %in% drop]
selVars
## [1] "total.sulfur.dioxide" "density"
Now we can create our binary logistic regressor.
f <- paste(outcome,' == red ~ ',paste(selVars,collapse=' + '),sep='')
gmodel <- glm(as.formula(f),data=dTrain, family=binomial(link='logit'))
log.pred <- c(
calcAUC(predict(gmodel, newdata = dTrain), dTrain[, outcome]),
calcAUC(predict(gmodel, newdata = dCal), dCal[, outcome]),
calcAUC(predict(gmodel, newdata = dTest), dTest[, outcome]))
log.pred
## [1] 0.9822171 0.9940508 0.9810298
We can see that our model performs similarly on the three datasets meaning that it generalises well. The AUCs are also very high.
Let’s plot the logistic curve to try and visualise the performance.
predict.data <- data.frame(probability.of.color = gmodel$fitted.values, color = dTrain$color)
predict.data <- predict.data[ order(predict.data$probability.of.color, decreasing = FALSE), ]
predict.data$rank <- 1:nrow(predict.data)
predict.data %>%
ggplot( aes(x=rank, y=probability.of.color)) +
geom_point( aes(color=color), alpha=1, shape=4, stroke=2 ) +
xlab("index") +
ylab("Pred Probability of Red") +
labs(title="Logistic Regression Curve") +
global_theme
We can see that most of the red wines are predicted to have a relatively high probability and most white wines are predicted to have a relatively low probability. Thus, it seems that we have a good model but let us cross-validate to make sure.
# K-FOLD CROSS VALIDATION
fCrossLogit <- function() {
useForCalRep <- rbinom(n = nrow(dTrainAll), size = 1, prob = 0.1 ) > 0
predRep <- predict(gmodel, newdata = dTrainAll[useForCalRep, ])
calcAUC(predRep, dTrainAll[useForCalRep, outcome])
}
k <- 100
aucs <- replicate(k, fCrossLogit())
print(sprintf("cross-validation mean: %4.3f, cross-validation standard deviation: %4.3f", mean(aucs), sd(aucs)))
## [1] "cross-validation mean: 0.984, cross-validation standard deviation: 0.006"
Again, we have a cross-validation mean that is very close to the AUC calculations above. A small standard deviation verifies that the cross-validation AUC does not vary much. Viewing the logistic regression curve and cross-validating should be enough to justify that we have a good model here. We will use more sophisticated analysis techniques when building a multi-class classifier for wine quality.
We will now explore a decision tree model for classification. Specifically, we will build three models.
# MODEL ONE
fmla1 <- paste0(outcome,"== 'red' ~ ", paste(c(catVars, numericVars), collapse = " + "))
tmodel1 <- rpart(fmla1, data = balancedTrain)
decision.tree.1 <- c(
calcAUC(predict(tmodel1, newdata = balancedTrain), balancedTrain[, outcome]),
calcAUC(predict(tmodel1, newdata = dCal), dCal[, outcome]),
calcAUC(predict(tmodel1, newdata = dTest), dTest[, outcome])
)
decision.tree.1
## [1] 0.9783275 0.9895650 0.9773116
# MODEL TWO
prefixed.vars <- paste0('pred', c(catVars, numericVars))
fmla2 <- paste0(outcome,"== 'red' ~ ", paste(prefixed.vars, collapse = " + "))
tmodel2 <- rpart(fmla2, data = balancedTrain)
decision.tree.2 <- c(
calcAUC(predict(tmodel2, newdata = balancedTrain), balancedTrain[, outcome]),
calcAUC(predict(tmodel2, newdata = dCal), dCal[, outcome]),
calcAUC(predict(tmodel2, newdata = dTest), dTest[, outcome]))
decision.tree.2
## [1] 0.9824090 0.9915335 0.9853474
# MODEL THREE
tmodel3 <-
rpart(
fmla1, data = balancedTrain,
control = rpart.control(cp = 0.001, minsplit = 1000, minbucket = 1000, maxdepth = 5)
)
decision.tree.3 <- c(
calcAUC(predict(tmodel3, newdata = balancedTrain), balancedTrain[, outcome]),
calcAUC(predict(tmodel3, newdata = dCal), dCal[, outcome]),
calcAUC(predict(tmodel3, newdata = dTest), dTest[, outcome]))
decision.tree.3
## [1] 0.9661766 0.9790952 0.9641451
# DISPLAY RESULTS
dt.matrix <- matrix(
c(decision.tree.1, decision.tree.2, decision.tree.3),
nrow = 3, ncol = 3, byrow = FALSE,
dimnames = list(c("Train", "Cal", "Test"),
c("decision tree 1", "decision tree 2", "decision tree 3"))
)
dt.matrix
## decision tree 1 decision tree 2 decision tree 3
## Train 0.9783275 0.9824090 0.9661766
## Cal 0.9895650 0.9915335 0.9790952
## Test 0.9773116 0.9853474 0.9641451
It is evident that all our models perform well on the training set as well as generalising on the unseen data. The decision tree trained using the pred-prefix values (decision tree 2) performs best. Let’s try and explain these results.
First, let’s print the rules so that we can clearly see the quantitative structure.
rpart.rules(tmodel2, cover = TRUE, style = "tall")
Now, let’s visualise the hierarchical structure of our decision tree.
# DECISION TREE - MODEL 2 (COLOR == RED)
rpart.plot(tmodel2, box.palette="Reds")
For ease of explainability, small fitted values are displayed as white to indicate white wine classification. Large fitted values are displayed as red to indicate a red wine classification. A continuous color scale exists for inbetween values.
The important takaways from this model are:
We should investigate if we can achieve similar results with only the two most important features. We will use the non pred-prefixed values for our investigation.
# FEATURE SELECTION MODEL
fmlaFS <- paste0(outcome,"== 'red' ~ ", paste(c("total.sulfur.dioxide", "chlorides"), collapse = " + "))
tmodel4 <-
rpart(
fmlaFS, data = balancedTrain,
control = rpart.control(cp = 0.001, minsplit = 1000, minbucket = 1000, maxdepth = 5)
)
decision.tree.4 <- c(
calcAUC(predict(tmodel4, newdata = balancedTrain), balancedTrain[, outcome]),
calcAUC(predict(tmodel4, newdata = dCal), dCal[, outcome]),
calcAUC(predict(tmodel4, newdata = dTest), dTest[, outcome]))
decision.tree.4
## [1] 0.9661766 0.9790952 0.9641451
Wow, you can see that we still get amazing results from just two feature variables. It seems that total.sulfur.dioxide and chlorides are highly explainable of wine color. However, we will still stick with our best model when comparing results in the next section.
Having modeled our logistic regression and our three decision trees, we create the visual to compare them against each other and the null model:
Now that we have created our models, we can compare the performance against the null model. As we have seen that our classifiers perform very well, it would be better to compare against a saturated model. Below are the calculations for a Bayes rate model.
# SELECT ONLY THE FEATURE VARS
wine.features <- subset(wine, select = -c(color) )
# FIND ROWS THAT ARE DUPLICATED
duplicates <- duplicated(wine.features) | duplicated(wine.features, fromLast = TRUE)
# UPDATE FEATURE DF
wine.features <- wine.features[ duplicates, ]
print(sprintf("Feature variables have %d duplicated rows", dim(wine.features)[1]))
## [1] "Feature variables have 2172 duplicated rows"
# ADD BACK THER LABELS
wine.features$color <- wine$color[duplicates]
# CHECK IF THERE ARE DUPLICATES
duplicates <- duplicated(wine.features) | duplicated(wine.features, fromLast = TRUE)
wine.features <- wine.features[ duplicates, ]
print(sprintf("Feature variables with labels have %d duplicated rows", dim(wine.features)[1]))
## [1] "Feature variables with labels have 2169 duplicated rows"
We can conclude that there are \(2172\) observations that have the same feature variable values. \(2169\) of these have matching labels (color). Thus \(2172 - 2169 = 3\) observations have identical feature variables with different colors. Thus, a Bayes rate model will be a perfect model that only makes \(3\) mistakes when there are multiple examples with the same set of known facts but have different outcomes.
The saturated model is calculated below.
mistakes <- 3
obs <- dim(wine)[1]
saturated.model <- (obs - mistakes) / obs
saturated.model
## [1] 0.9995379
Now let us visualise the comparisons.
# CONSTRUCT EVAL DF
eval.df <- data.frame(
Model = rep(c("decision tree 1", "decision tree 2", "decision tree 3", "binary logistic model"), each = 3),
Data = rep(c("Train", "Cal", "Test"), times = 4),
Value = c(decision.tree.1, decision.tree.2, decision.tree.3, log.pred)
)
# PLOT DF RESULTS
ggplot(eval.df,
# PLOT RESULTS
aes(x = factor(Model, levels = unique(Model)), y = Value, fill = factor(Data, levels = unique(Data)))) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(y = 1.02, label = round(Value, digits = 3)), position = position_dodge(width = 1), size = 4) +
# PLOT NULL MODEL
geom_hline(yintercept = 0.7537, color = "black") +
annotate(geom = "text", x = 4.3, y = 0.77, label = "NULL Model", colour = "black") +
# PLOT SATURATED MODEL
geom_hline(yintercept = saturated.model, color = "black") +
annotate(geom = "text", x = 0.6, y = 0.99, label = "Saturated Model", colour = "black") +
# LABELS AND AESTHETICS
ylim(0, 1.2) +
ylab("Value") +
xlab("Model") +
labs(title="Model Evaluation") +
global_theme +
theme(legend.title = element_blank())
We can see that all our models greatly outperform the null model and are very closely bounded to the saturated model. We can identify that the decision tree 2 (trained on the pred-prefix values) slightly outperforms the binary logistic regressor. However, all models perform very well.
Let’s select the decision tree 2 as our preferred model and further analise.
# GET PERFORMANCE MEASURES
performanceMeasures <- function(pred, truth, name = "model") {
# DEV
dev.norm <- -2 * logLikelihood(truth, pred)/length(pred)
# ACCURACY, PRECISION AND RECALL
ctable <- table(truth = truth==pos, pred = (pred > 0.5))
accuracy <- sum(diag(ctable)) / sum(ctable)
precision <- ctable[2, 2] / sum(ctable[, 2])
recall <- ctable[2, 2] / sum(ctable[2, ])
# F1 SCORE
f1 <- 2 * precision * recall / (precision + recall)
# RETURN DF WITH PERFORMANCE MEASURES
data.frame(model = name, precision = precision, recall = recall,f1 = f1, dev.norm = dev.norm)
}
# FORMATTER
panderOpt <- function() {
# SET PANDER OPTIONS
panderOptions("plain.ascii", TRUE)
panderOptions("keep.trailing.zeros", TRUE)
panderOptions("table.style", "simple")
}
# PRETTY PERFORMANCE TABLE FUNCTION
pretty_perf_table <- function(model, training, test) {
# SET PANDER OPTIONS
panderOpt()
perf_justify <- "lrrrr"
# COMPARING PERFORMANCE ON TRAIN VS TEST
pred_train<-predict(model,newdata=training)
truth_train <- training[,outcome]
pred_test<-predict(model,newdata=test)
truth_test <- test[,outcome]
trainperf_tree <- performanceMeasures(
pred_train,truth_train,"training")
testperf_tree <- performanceMeasures(
pred_test,truth_test, "test")
perftable <- rbind(trainperf_tree, testperf_tree)
pandoc.table(perftable, justify = perf_justify)
}
# PLOT ROC CURVE
plot_roc <- function(predcol1, outcol1, predcol2, outcol2) {
roc_1 <- rocit(score=predcol1,class=outcol1==pos)
roc_2 <- rocit(score=predcol2,class=outcol2==pos)
plot(roc_1, col = c("blue","green"), lwd = 3,
legend = FALSE,YIndex = FALSE, values = TRUE)
lines(roc_2$TPR ~ roc_2$FPR, lwd = 1, col = c("red","green"))
legend("bottomright", col = c("blue","red", "green"),
c("Test Data", "Training Data", "Null Model"), lwd = 2)
}
# PLOT ROC CURVE
plot_roc2 <- function(predcol1, outcol1, predcol2, outcol2, predcol3, outcol3, predcol4, outcol4) {
roc_1 <- rocit(score=predcol1,class=outcol1==pos)
roc_2 <- rocit(score=predcol2,class=outcol2==pos)
roc_3 <- rocit(score=predcol3,class=outcol3==pos)
roc_4 <- rocit(score=predcol4,class=outcol4==pos)
plot(roc_1, col = c("blue","green"), lwd = 3, legend = FALSE,YIndex = FALSE, values = TRUE)
lines(roc_2$TPR ~ roc_2$FPR, lwd = 1, col = c("red","green"))
lines(roc_3$TPR ~ roc_3$FPR, lwd = 1, col = c("orange","green"))
lines(roc_4$TPR ~ roc_4$FPR, lwd = 1, col = c("purple","green"))
legend("bottomright", col = c("blue","red", "orange", "purple", "green"),c("dtree 1", "dtree 2", "dtree 3", "binary logistic", "Null Model"), lwd = 2)
}
pretty_perf_table(tmodel2, balancedTrain, dTest)
##
##
## model precision recall f1 dev.norm
## ---------- ----------- -------- ------- ----------
## training 0.9836 0.9685 0.976 0.2086
## test 0.9500 0.9500 0.950 0.2058
pred_test_roc <- predict(tmodel2,newdata=dTest)
pred_train_roc <- predict(tmodel2,newdata=balancedTrain)
plot_roc(pred_test_roc, dTest[[outcome]], pred_train_roc, balancedTrain[[outcome]])
The above further validates our high performing model with the ROC curves for both the test and train dataset show high AUC values. The performance table shows a small deviance. It also shows a high f1 score indicating that both the precision and recall are high.
For completeness, we will plot all the models ROC curves for comparison.
tmodel1_roc <- predict(tmodel1,newdata=dTest)
tmodel2_roc <- predict(tmodel2,newdata=dTest)
tmodel3_roc <- predict(tmodel3,newdata=dTest)
logistic_roc <- predict(gmodel,newdata=dTest)
plot_roc2(tmodel1_roc, dTest[[outcome]], tmodel2_roc, dTest[[outcome]], tmodel3_roc, dTest[[outcome]], logistic_roc, dTest[[outcome]])
We have found that classifying color is a pretty trivial task and a very simple model can bring great results. Let’s now test our luck with solving a multi-class classifier problem to predict wine quality. An interesting approach would be to construct a regression model and round the output to the nearest whole number when making predictions.
First, let’s create a new test and train set.
# 80-20 SPLIT FOR TRAIN AND TEST
sampleSize <- round(nrow(wine)*0.8)
idx <- sample(seq_len(sampleSize), size = sampleSize) # sample indexes
# CONSTRUCT SETS
X.train.reg <- wine[idx,]
X.test.reg <- wine[-idx,]
And now we can construct a simple linear regression model to fit the training data.
# CREATE OUR LINEAR REGRESSION MODEL
model.reg1 <- lm(quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol,
data = X.train.reg)
summary(model.reg1)
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = X.train.reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5461 -0.4683 -0.0287 0.4822 2.5629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.008e+01 6.218e+00 -3.229 0.001250 **
## fixed.acidity -2.440e-02 1.203e-02 -2.027 0.042671 *
## volatile.acidity -1.416e+00 8.111e-02 -17.453 < 2e-16 ***
## citric.acid -8.294e-02 8.792e-02 -0.943 0.345535
## chlorides -1.188e+00 3.476e-01 -3.418 0.000636 ***
## free.sulfur.dioxide 7.358e-03 8.860e-04 8.305 < 2e-16 ***
## total.sulfur.dioxide -1.787e-03 3.004e-04 -5.949 2.88e-09 ***
## density 2.269e+01 6.288e+00 3.608 0.000312 ***
## pH -2.265e-02 7.999e-02 -0.283 0.777047
## sulphates 6.855e-01 7.964e-02 8.607 < 2e-16 ***
## alcohol 3.616e-01 1.376e-02 26.274 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.741 on 5183 degrees of freedom
## Multiple R-squared: 0.2997, Adjusted R-squared: 0.2983
## F-statistic: 221.8 on 10 and 5183 DF, p-value: < 2.2e-16
The model summary shows that many predictor variables are insignificant and the adjusted R-squared is poor. Let’s continue inspecting our first model.
X.test.reg$predQuality <- predict(model.reg1,newdata=X.test.reg)
X.train.reg$predQuality <- predict(model.reg1,newdata=X.train.reg)
p<- ggplot(data=X.test.reg) +
geom_point(aes(x=predQuality,y=quality),alpha=0.2,color="black") +
geom_smooth(aes(x=predQuality,y=quality),color="black") +
geom_line(aes(x=quality,y=quality),color="blue",linetype=2) +
scale_x_continuous(limits=c(4,8)) +
scale_y_continuous(limits=c(4,8)) +
global_theme
p
The black line indicates the average relation between the predicted and actual quality and the blue line indicates the ideal relationship.
Firstly, let’s check outliers of the dataset whether any high leverage high influence exist. We could use four plots here, i.e. Residuals vs Fitted, Normal Q-Q, Cook’s Distance, and Residuals vs Leverage. For your information regarding those plots, you could read
One way to improve our linear model is to check for outliers. Let’s look at the Residual vs Fitted, Q-Q, Cook’s Distance, and Residuals vs Leverage plots.
par(mfrow=c(2,2))
plot(model.reg1, which = 1, cook.levels = c(0.05, 0.1)) %>% invisible()
plot(model.reg1, which = 2, cook.levels = c(0.05, 0.1)) %>% invisible()
plot(model.reg1, which = 4, cook.levels = c(0.05, 0.1)) %>% invisible()
plot(model.reg1, which = 5, cook.levels = c(0.05, 0.1)) %>% invisible()
Let’s now remove these identified outliers and see if we have an improved model.
# OUTLIER IDX
to.rm <- c(4582,3707,2423,2143,4955,4933)
# REMOVE OUTLIERS
X.train.reg <- X.train.reg[-to.rm,]
Now let’s make another model.
model.reg2 <- lm(quality ~ fixed.acidity + volatile.acidity + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + sulphates + alcohol,
data = X.train.reg)
summary(model.reg2)
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + density + sulphates +
## alcohol, data = X.train.reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5283 -0.4647 -0.0314 0.4832 2.5708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.922e+01 6.000e+00 -3.203 0.001369 **
## fixed.acidity -2.630e-02 9.801e-03 -2.683 0.007309 **
## volatile.acidity -1.385e+00 7.265e-02 -19.064 < 2e-16 ***
## chlorides -1.233e+00 3.422e-01 -3.602 0.000319 ***
## free.sulfur.dioxide 7.368e-03 8.858e-04 8.318 < 2e-16 ***
## total.sulfur.dioxide -1.809e-03 2.874e-04 -6.294 3.36e-10 ***
## density 2.176e+01 5.993e+00 3.630 0.000286 ***
## sulphates 6.823e-01 7.819e-02 8.726 < 2e-16 ***
## alcohol 3.592e-01 1.333e-02 26.939 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.741 on 5179 degrees of freedom
## Multiple R-squared: 0.2996, Adjusted R-squared: 0.2985
## F-statistic: 276.9 on 8 and 5179 DF, p-value: < 2.2e-16
Let’s compare performance measures between both models to identify if there is any improvement.
reg.performance <- function(model, y_pred, y_true) {
adj.r.sq <- summary(model)$adj.r.squared
mse <- MSE(y_pred, y_true)
rmse <- RMSE(y_pred, y_true)
mae <- MAE(y_pred, y_true)
print(paste0("Adjusted R-squared: ", round(adj.r.sq, 4)))
print(paste0("MSE: ", round(mse, 4)))
print(paste0("RMSE: ", round(rmse, 4)))
print(paste0("MAE: ", round(mae, 4)))
}
# MODEL 1 PERFORMANCE
reg.performance(model = model.reg1, y_pred = model.reg1$fitted.values, y_true = X.train.reg$quality)
## [1] "Adjusted R-squared: 0.2983"
## [1] "MSE: 0.8319"
## [1] "RMSE: 0.9121"
## [1] "MAE: 0.7176"
# MODEL 2 PERFORMANCE
reg.performance(model = model.reg2, y_pred = model.reg2$fitted.values, y_true = X.train.reg$quality)
## [1] "Adjusted R-squared: 0.2985"
## [1] "MSE: 0.5481"
## [1] "RMSE: 0.7403"
## [1] "MAE: 0.5771"
It is clear that the linear model trained on a dataset without outliers outperforms the first model as seen in the drop of the error metrics. The adjusted R-squared value does not vary between model indicating poor performance. Let’s finish this section by plotting the best model.
X.test.reg$predQuality <- predict(model.reg2,newdata=X.test.reg)
X.train.reg$predQuality <- predict(model.reg2,newdata=X.train.reg)
# MODEL 2 PERFORMANCE ON TRAIN
p1 <- ggplot(data=X.train.reg) +
geom_point(aes(x=predQuality,y=quality),alpha=0.2,color="black") +
geom_smooth(aes(x=predQuality,y=quality),color="black") +
geom_line(aes(x=quality,y=quality),color="blue",linetype=2) +
scale_x_continuous(limits=c(4,8)) +
scale_y_continuous(limits=c(4,8)) +
labs(title = "Train Data") +
global_theme
# MODEL 2 PERFORMANCE ON TEST
p2 <- ggplot(data=X.test.reg) +
geom_point(aes(x=predQuality,y=quality),alpha=0.2,color="black") +
geom_smooth(aes(x=predQuality,y=quality),color="black") +
geom_line(aes(x=quality,y=quality),color="blue",linetype=2) +
scale_x_continuous(limits=c(4,8)) +
scale_y_continuous(limits=c(4,8)) +
labs(title = "Test Data") +
global_theme
grid.arrange(p1, p2, ncol=2)
The linear regression model has been examined and improved. It performs ineffective in modelling the train dataset as seen in the plot above. The best adjusted R-squared value produced is only at \(0.2985\). This poor performance in the training set reflect the poor performance on the test set. It is evident that a linear regression model is not suitable for this data.
It was earlier mentioned that perhaps we care more about if a wine is a low quality versus a medium quality as apposed to a quality \(6\) versus a quality \(7\) wine. Let’s have a look at the raw errors of our prefered models predictions. Remember, we round these predictions to the nearest whole number to give a categorical score between \(1-10\) as previously described.
# DISPLAY RAW ERRORS - AFTER ROUNDING PREDICTION
raw.errors <- round(model.reg2$residuals, 0)
qplot(raw.errors,geom="bar") +
global_theme
It is evident that most of the mistakes that the model makes is within a margin of \(1\). That is, most mistakes either misclassify by a quality of one more, or one less than the expected outcome. The confidence interval of the approximate normal distribution verifies this conjecture.
mu <- mean(raw.errors)
sigma <- sd(raw.errors)
z <- 1.645 # 90% confidence interval
confidence_interval(mu, sigma, z)
## [1] -1.313802 1.345799
Depending on what error tolerance we are willing to consider will depend on weather this model is actually viable or not.
We now wish to explore the relationships between different wine qualities of both red and white wine according to their alcohol content. We will rework the original dataset such that wine qualities are categorised into low, med and high. We will group these observations according to the color (e.g. “high.red” or “med.white”)
We are going to work with the original dataset.
head(wine)
Let’s first bin our quality variables into more low-med-high categories.
c.df <- wine
# REBIN QUALITY FEATURE
c.df$quality.cat <- cut(c.df$quality, breaks=seq(0, 10, 3), labels=c("low", "medium", "high"))
str(c.df)
## tibble [6,492 × 16] (S3: tbl_df/tbl/data.frame)
## $ fixed.acidity : num [1:6492] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num [1:6492] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num [1:6492] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num [1:6492] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num [1:6492] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num [1:6492] 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num [1:6492] 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num [1:6492] 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num [1:6492] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num [1:6492] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num [1:6492] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : num [1:6492] 5 5 5 6 5 5 5 7 7 5 ...
## $ good : num [1:6492] 0 0 0 0 0 0 0 1 1 0 ...
## $ color : Factor w/ 2 levels "red","white": 1 1 1 1 1 1 1 1 1 1 ...
## $ quality.factor : Factor w/ 7 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## $ quality.cat : Factor w/ 3 levels "low","medium",..: 2 2 2 2 2 2 2 3 3 2 ...
Now let’s bind wine color to this newly created feature.
c.df <- c.df %>% mutate(quality.color = factor(paste0(quality.factor, ".", color)))
clust.df <- c.df %>%
group_by(quality.color) %>%
subset( select = c(quality.color, alcohol, volatile.acidity) )
set.seed(42)
clust.df <- clust.df[sample(nrow(clust.df), 60), ]
It is important to notice that our data has differing units. For example, residual sugar is measured in \(\frac{g}{dm^3}\) and free sulfur dioxide is measured in \(\frac{mg}{dm^3}\). Disparity in units will affect what clustering an algorithm will discover thus it is important to standardise our numerical variables so that all features give the same affect. It may be interesting to see the results when converting all variables into SI units but we are going to keep things simple and use standardise.
We will use euclidean distance for our clustering.
vars.to.use <- colnames(clust.df[-1])
pmatrix <- scale(clust.df[, vars.to.use])
pcenter <- attr(pmatrix, "scaled:center")
pscale <- attr(pmatrix, "scaled:scale")
d <- dist(pmatrix, method = "euclidean")
Before performing our clustering, we need to find the optimal value of k. The k value will determine the number of clusters that we model against and can be derived through the average distance from each point to the centre of its cluster and the average distance between clusters.
sqr_edist <- function(x, y) {
sum((x - y) ^ 2)
}
wss.cluster <- function(clustermat) {
c0 <- apply(clustermat, 2, FUN = mean)
sum(apply(clustermat, 1,
FUN = function(row) {sqr_edist(row, c0)}))
}
wss.total <- function(dmatrix, labels) {
wsstot <- 0
k <- length(unique(labels))
for(i in 1:k){
wsstot <- wsstot +
wss.cluster(subset(dmatrix, labels == i))
}
wsstot
}
totss <- function(dmatrix) {
grandmean <- apply(dmatrix, 2, FUN = mean)
sum(apply(dmatrix, 1,
FUN=function(row) {
sqr_edist(row, grandmean)
}
)
)
}
The Calinski-Harabasz index (CH) is defined as the ratio of inter-cluster variance to the total intra-cluster variance and is a popular metric for selecting the optimal value for k.
Let’s create a function to compute this metric.
rm(var)
ch_criterion <- function(dmatrix, kmax, method = "kmeans") {
if(!(method %in% c("kmeans", "hclust"))){
stop("method must be one of c('kmeans', 'hclust')")
}
npts <- nrow(dmatrix)
totss <- totss(dmatrix)
wss <- numeric(kmax)
crit <- numeric(kmax)
wss[1] <- (npts - 1) * sum(apply(dmatrix, 2, FUN = var))
for(k in 2:kmax) {
if(method == "kmeans") {
clustering <- kmeans(dmatrix, k, nstart = 10, iter.max = 100)
wss[k] <- clustering$tot.withinss
}else {
d <- dist(dmatrix, method = "euclidean")
pfit <- hclust(d, method = "ward")
labels <- cutree(pfit, k = k)
wss[k] <- wss.total(dmatrix, labels)
}
}
bss <- totss - wss
crit.num <- bss / (0:(kmax - 1))
crit.denom <- wss/(npts - 1:kmax)
list(crit = crit.num / crit.denom, wss = wss, bss = bss, totss = totss)
}
Now let’s plot the CH index over a range of different k values.
clustcrit <- ch_criterion(pmatrix, 12, method = "hclust")
critframe <- data.frame(k = 1:12, ch = scale(clustcrit$crit), wss = scale(clustcrit$wss), bss = scale(clustcrit$bss))
critframe <- melt(critframe, id.vars = c("k"), variable.name = "measure", value.name = "score")
ggplot(critframe, aes(x = k, y = score, color = measure)) +
geom_point(aes(shape = measure), size = 2) +
geom_line(aes(linetype = measure), size = 1) +
scale_x_continuous(breaks = 1:20, labels = 1:20) +
global_theme
Typically, we look for where the benefit of an increased number of clusters slows down and select this point as the value of k. Visually, we will see a distinct ‘elbow’ in the plot above. Three or four look look promising. Let’s compute the best value using R. NbClust finds the optimum value for k reported over \(30\) different clustering indexes.
This algorithm has a poor time complexity. A method to work around this issue is to create a representative sample of the data set. We should repeat the algorithm a number of times on different samples so that we do not miss any fine details in the population set.
# FIND OPTIMAL K USING NbClust
library(NbClust)
library(factoextra)
n <- 100
epoch <- 5
kVals <- c()
# CREATE EPOCH DIFF SAMPLES TOO SEE IF CONCENSUS MET ON k
for (i in 1:epoch) {
smple_ <- sample_n(clust.df, n, replace = TRUE)
k.select_ <- smple_ %>% subset( select = -c(quality.color) ) %>% NbClust(method = "ward.D2", index = "all", min.nc = 2, max.nc = 15)
kVals <- c( kVals, length(unique(k.select_$Best.partition)) )
}
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 10 proposed 3 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 3 proposed 13 as the best number of clusters
## * 5 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 12 proposed 3 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 4 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 10 proposed 3 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 11 as the best number of clusters
## * 2 proposed 13 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 4 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 11 proposed 3 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 4 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 12 proposed 3 as the best number of clusters
## * 2 proposed 13 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 4 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
Let’s view the results from the final sample created in the loop above.
# FIND NbClust BEST NUMBER OF CLUSTERS
fviz_nbclust(k.select_, barfill = "#00BFC4", barcolor = "#00BFC4") +
theme_gray() +
ggtitle("NbClust's optimal number of clusters")
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 1 proposed 1 as the best number of clusters
## * 4 proposed 2 as the best number of clusters
## * 12 proposed 3 as the best number of clusters
## * 2 proposed 13 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 4 proposed 15 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 3 .
We can identify that the majority of clustering indexes selected \(3\) as the optimum value for k for a given sample. The below sumarises the results.
kVals
## [1] 3 3 3 3 3
Now let’s check the stability of these clusters.
# BEST k
kbest.p <- 3
# RUNNING CLUSTERBOOT
cboot.hclust <- clusterboot(pmatrix,
clustermethod = hclustCBI,
method = "ward.D",
k = kbest.p,
count=FALSE)
# CHECK STABILITY
groups <- cboot.hclust$result$partition
1 - cboot.hclust$bootbrd / 100
## [1] 0.97 1.00 0.84
We can see that all clusters are highly stable so we can confirm that \(3\) is the best value of k.
The clustering methodology we will use is hierarchical clustering. The below dendrogram displays each of our 42 ‘Road User Age’ groups in their respective cluster.
# CONSTRUCT DENDROGRAM
pfit <- hclust(d, method="ward.D2")
plot(pfit, labels = clust.df$quality.color)
rect.hclust(pfit, k = kbest.p)
Let’s plot the clusters for better visualisation using principal component transformation.
# PRINCIPAL COMPONENT TRANSFORMATION
groups <- cutree(pfit, k = kbest.p)
princ <- prcomp(pmatrix)
nComp <- 2
project <- as.data.frame(predict(princ, newdata=pmatrix)[,1:nComp])
project.plus <- cbind(project, cluster=as.factor(groups), Group=clust.df$quality.color)
h <- do.call(
rbind,
lapply(
unique(groups),
function(c) {
f <- subset(project.plus, cluster == c);
f[chull(f), ]
}
)
)
# VISUALISE RESULTS
ggplot(project.plus, aes(x = PC1, y = PC2)) +
geom_point(aes(shape = cluster, color = cluster)) +
geom_text(aes(label = Group, color = cluster),
hjust = 0, vjust = 1, size = 3) +
geom_polygon(data = h, aes(group = cluster, fill = as.factor(cluster)),
alpha = 0.4, linetype = 0) +
theme(legend.position = "") +
global_theme
The above indicates three distinct clusters with no overlap. It is an interesting observation to see that various wine quality/colors have some degree of clustering together. The blue cluster consists of almost all quality \(5\) wines. The green cluster presents mostly high quality white wines and the red cluster has many mid to high level wines that are predominantly white. This suggests a relationship does exist between these groups and the alcohol content and volatile acidity of wine.
It should be mentioned that a small representative sample of \(50\) instances is used to produce the above cluster. I used different seeds values to produce clusters from other representative samples and can conclued that the result above generalises well and is representative of the entire data set. This experimentation is ommited from this final report.
In this exploration and analysis of Wine Quality data set we faced a number of challenges:
These challenges reinforce the need for care to be taken when modelling data. This is especially important when these solution affect the domain in some way. Our analysis did uncover valuable relationships in the chemistry of wine. One such example is that better wine qualities tend to have a higher alcohol content. This makes me question how the quality of wine is assessed. Is there bias toward how alcohol affects the decision of a wine being good or bad? The analysis also outlines that not all chemical properties are valuable where feature selection techniques give similar or better results.
It would be interesting to further investigate which combination of chemical properties create the best wines. This could aid wine makers in raising the quality of their wines.